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Abstract. Recommendation systems have been integrated into the ma¬ 
jority of large online systems to hlter and rank information according to 
user profiles. It thus influences the way users interact with the system and, 
as a consequence, bias the evaluation of the performance of a recommen¬ 
dation algorithm computed using historical data (via ojfline evaluation). 

This paper presents a new application of a weighted offline evaluation to 
reduce this bias for collaborative filtering algorithms. 

1 Introduction 

Recommendation systems have been very frequently studied in the literature and 
aim to provide a user with a set of possibly ranked items that are supposed to 
match the interests of the user . Applications of such systems are ubiquitous 
in the Internet (e-commerce, online advertising, social networks, ...), and can be 
seen as a way to adapt a system to a user. 

Obviously, recommendation algorithms must be evaluated before and dur¬ 
ing their active use in order to ensure their performance. Live monitoring is 
generally achieved using online performance metrics (e.g. click-through rate of 
displayed ads) whereas offline evaluation is computed using historical data. Of¬ 
fline evaluation allows to quickly test several strategies without having to wait 
for real metrics to be collected nor impacting the performance of the online sys¬ 
tem. One of the main strategies of offline evaluation consists in simulating a 
recommendation by removing a confirmation action (click, purchase, etc.) from 
a user profile and testing whether the item associated to this action would have 
been recommended based on the rest of the profile [7]. 

As presented in mm this scheme ignores various factors that have influenced 
historical data as the recommendation algorithms previously used, promotional 
offers on some specific products, etc. Even if limits of evaluation strategies for 
recommendation algorithms have been identified (Emil), this protocol is still 
intensively used in practice. 

We study in this paper the general principle of instance weighting proposed 
in [1] and show its practical relevance beyond the simple case of constant rec¬ 
ommendation (i.e. if recommendations are the same for every user). In addition 
to its good performances, this method is more realistic than solutions proposed 


in nil] for which a data collection phase based on random recommendations 
has to be performed. While this phase allows one to build a bias free evaluation 
data set, it has also adverse effects in terms of e.g. public image or business 
performance when used on a live system. 

The rest of the paper is organized as follows. Section^describes in details the 
setting and the problem. Section [^introduces the weighting scheme proposed to 
reduce the evaluation bias. Section [^demonstrates the practical relevance of our 
method on real world data extracted from Viadeo (professional social networl0. 

2 Problem formulation 

2.1 Notations and setting 

We denote U the set of users, I the set of items and Vt the historical data 
available at time t. A recommendation algorithm is a function g from U x Vt 
to some set built from I. We will denote gt{u) = g{u,'Dt) the recommendation 
computed by g at instant t for user u. We assume given a quality function I from 
the product of the result space of g and I to M’*' that measures to what extent 
an item i is correctly recommended by g at time t via l{gt(u),i). We denote lu 
the items associated to a user u. 

Offline evaluation is based on the possibility of “removing” any item i from 
a user profile. The result is denoted U-i and gt(u-i) is the recommendation 
obtained at instant t when i has been removed from the profile of user u. 

Finally, offline evaluation follows a general scheme in which a user is chosen 
according to some probability on users P{u), which might reflect the business 
importance of the users. Given a user, an item i is chosen among the items asso¬ 
ciated to its profile, according to some conditional probability on items P{i\u). 
When an item i is not associated to a user u (that is i ^ lu), P{i\u) = 0. A 
very common choice for P{u) is the uniform probability on U and it is also very 
common to use a uniform probability for P{i\u) (other strategy could favor items 
recently associated to a profile). As the system evolves over the time, P{u) and 
P{i) depends on t. 

The two distributions P{u) and P{i\u) lead to a joint distribution P{u,i) = 
P{i\u)P{u) on [/ X /. 

2.2 Origin of the bias in offline evaluation 

The classic offline evaluation procedure consists in calculating the quality of the 
recommendation algorithm g at instant t as Lt(g) = E,{l{gt{u-i),i)) where the 
expectation is taken with respect to the joint distribution: 

Lt{g) = Pt{i\u)Ptiu)l{gtiu-i),i). (1) 

{u,i)£U xl 

Then if two algorithms are evaluated at two different moments, their qualities 
are not directly comparable. Although as in an online system P{i\u) evolves over 


^See http://corporate.viadeo.coin/6n/ for more information about Viadeo. 




tim^ once a recommendation algorithm is chosen based on a given state of the 
system, it starts influencing the state of the system when put in production, 
inducing an increasing distance between its evaluation environment (i.e. the 
initial state of the system) and the evolving state of the system. This influence 
is responsible for a bias on offline evaluation as it relies on historical data. 

A naive solution to this bias would be to compare algorithms only with 
respect to the original database at to, but it would discard natural evolutions of 
user profiles. 

3 Reducing the evaluation bias 

3.1 A suggested method to reduce the bias 

A simple transformation of equation Q shows that for a constant algorithm g: 
Lt{g) = EiGi Pt{i)l{gt,i)- As a consequence, a way to guarantee a stationary 
evaluation framework for a constant algorithm is to have constant values for the 
marginal distribution of items, 

A natural solution would be to record those probabilities at to and use them 
as the probability to select an item in offline evaluation at ti > to- However, as 
the selection of users and items leads to a joint distribution, this would require 
to revert the way offline evaluation is done: first select an item, then select a 
user having this item with a certain probability TTt{u\i) leading to a different 
probability of users selection. Finally this process leads to a similar problem on 
users, and as in most of systems > #/, it is more efficient to follow the 
classical evaluation protocol. 

Moreover, we will see that the recalibration of every item is not necessary 
to reduce the main part of the bias. Indeed in practice most of the time a few 
items concentrate most of the recommendations (very popular items, discount 
on selected products, ...). Thus one can reduce the major part of the bias by 
optimizing the weight of the p items such that the deviation given by | 

have the strongest values. In practice p is chosen according to practical 
constraints (time) or business constraints. 

Thus the weighting strategy that we described in [1] consists in keeping the 
classical choice for Pt{u) and weighting Pt{i\u) by departing from the classical 
values for Pt{i\u) (such as using a uniform probability) in order to mimic static 
values for Pt^ (i) by : 


Pt{i\u,uj) 


0 JiPt{i\u) 


( 2 ) 


These weighted conditional probabilities lead to weighted item probabilities 
dehned by: 

Pt{i\u!) = ^ Pt{i\u,u!)Pt{u). (3) 

ueu 

^even if P{u) could also evolve over time we do not consider the effects of such evolution 
in the present article. 




Then we minimize the distance between and Ptoii) by optimizing 

the Kullback-Leibler divergence, defined by : 



where Ito represents the set of items present at Iq. The asymmetric nature of 
this distance is useful in our context to consider time Iq as a reference. Moreover 
this asymmetry reduces the influence of rare items at time to (as they were not 
very important in the calculation of ^* 0 ( 5 ))- 

3.2 Previous results 

As described in [J , in the classical offline evaluation approach the score of an al¬ 
gorithm in production, given by the classical offline evaluation, tends to increase 
over time. More generally, the classical offline evaluation tends to overestimate 
(resp. underestimate) the unbiased score of an algorithm similar (resp. orthog¬ 
onal) to the one in production. 

We have also shown in [T] that the suggested weighting strategy perfectly 
recalibrates the score obtained by the classical offline evaluation for constant 
algorithms and high values of p. Thus, this method seems to reduce the bias for 
the very simple class of constant algorithms. 

In the next part we apply this method to collaborative filtering algorithms. 

4 Experimentations on a collaborative filtering 

4.1 Data and metrics 

We consider real world data extracted from Viadeo, where skills are attached to 
user’s profile. The objective of the recommendation systems consists in suggest¬ 
ing new skills to users. The dataset contains 18294 users and 180 items (skills), 
leading to 117376 couples {u,i). 

Both probabilities Pt{u) and Pt{i\u) are uniform, and the quality function I is 
given by l{gt{u-i),i) = where gt{u-i) is a set of 5 items. The quality 

of a recommendation algorithm, Lt(g), is estimated via stochastic sampling in 
order to simulate what could be done on a larger data set than the one used 
for this illustration. We selected repeatedly 20 000 couples (user, item) (first we 
select a user u uniformly, then an item according to Pt{i\u,ui)). 

4.2 Collaborative filtering algorithms 

Let Xu,t be the vector of items of user u at time t {Xu^t & {0,1}^^)- Then 
Xu,t is a sparse vector as most of users are associated to only a few items. The 
objective of collaborative filtering algorithms is to estimate Xu,t' for t' > t using 
the information known on other users. In this paper we will present two different 
collaborative filtering algorithms: 
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The equation a) is known as collaborative filtering with cosine similarity, 
whereas the equation b) computes the proportion of users associated to item i 
among the one associated to items possessed by u. Then we will note naive CF 
(Collaborative Filtering) the algorithm b). 

Finally, the recommendation strategy consists in recommending the k items 
having the highest values in X„_t/. 


4.3 Results 

We apply the method described in Section to compute optimal weights at 
different instants and for several values of the parameter p. The collaborative 
filtering algorithms are the one presented in section [T^ Results are summarized 
in hgure 




(a) cosine similarity 


(b) naive CF 


Fig. 1: Results on the collaborative hltering with cosine similarity and naive 
respectively defined by equation a) and b) in section 4.2 for several values 
(the number of weights optimized). 


CF, 
of p 


The analysis is conducted on a 201 days period, from day 300 to day 500, 
where day 0 corresponds to the launch date of the skill feature. It is important 
to notice that two recommendation campaigns were conducted by Viadeo during 
this period at t = 330 and t = 430 respectively. As we can see on figure the 
scores strongly decrease after the first recommendation campaign {t = 330). 
Thus those campaigns have strongly biased the collected data, leading to a 
significant bias in the offline evaluation score. 

The figure shows the influence of the value of p: the higher is p the more 
weights are optimized and the more the bias is corrected. However, the effi¬ 
ciency of the recalibration depends on the algorithms. The results show that 













the weighting protocol permits to reduce the impact of recommendation cam¬ 
paigns on offline evaluation results as intended. However it does not lead to the 
stationarity of the score of collaborative hltering algorithms (while it leads to 
constant scores for constant algorithms). This can be explained by the nature 
of collaborative hltering: we cannot expect the score to be constant for such 
an algorithm as it depends on the correlation between users, which have been 
modihed by the recommendation campaigns. 

5 Conclusion 

Various factors inhuence historical data and bias the score obtained by classical 
offline evaluation strategy. Indeed, as recommendations inhuence users, a rec¬ 
ommendation algorithm in production tends to be favored by offline evaluation. 

We have presented a new application of the item weighting strategy inspired 
by techniques designed for tackling the covariate shift problem. Whereas our 
previous results presented the efficiency of this method for constant algorithms, 
we have shown that this method also reduces the bias of more elaborate algo¬ 
rithms. 

However the efficiency of this approach depends on algorithms as a recom¬ 
mendation campaign also introduces bias in the correlation between users. Thus 
the presented strategy reduces a part of the bias, and future works will focus on 
the structural bias introduced by recommendation campaigns. 
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